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Abstract 

Markov chain Monte Carlo methods explicitly defined on the manifold of 
probability distributions have recently been established. These methods 
are constructed from diffusions across the manifold and the solution of 
the equations describing geodesic flows in the Hamilton-Jacobi represen- 
tation. This paper takes the differential geometric basis of Markov chain 
Monte Carlo further by considering methods to simulate from probabil- 
ity distributions that themselves are defined on a manifold, with common 
examples being classes of distributions describing directional statistics. 
Proposal mechanisms are developed based on the geodesic flows over the 
manifolds of support for the distributions and illustrative examples are 
provided for the hypersphere and Stiefel manifold of orthonormal matri- 
ces. 

Keywords: Hamiltonian Monte Caro, geodesic, Riemannian manifold, 
directional statistics, Stiefel manifold. 



1 Introduction 

Markov chain Monte Carlo (MCMC) methods which originated in the physics 
literature have caused a revolution in statistical methodology over the last 20 
years by providing the means, now in an almost routine manner, to perform 
Bayesian inference over arbitrary non-conjugate prior and posterior pairs of 



distributions (Gilks, Richardson, and Spiegelhalter 119961 ) . 

A specific class of MCMC methods, originally known as Hybrid Monte Carlo 
(HMC), was develope d to m ore efficiently simulate quantum chromodynamic 



systems (Duane et al. 119871) . HMC goes beyond the random walk Metropolis 
or Gibbs sampling schemes and overcomes many of their shortcomings. In 
particular HMC methods are capable of proposing bold long distance moves 
in the state space which will retain a very high acceptance probability and 
thus improve the rate of convergence to the invariant measure of the chain and 
reduce the autocorrelation of samples drawn from the stationary distribution of 
the chain. The HMC proposal mechanism is based on sim ulating Hamiltonian 



dynamics defined by the target distribution, see Neal (|201ll) for a comprehensive 



tutorial. For this reason HMC is now routinely referred to as Hamiltonian 
Monte Carlo. Despite the relative strengths and attractive properties of HMC 
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it has largely been bypassed in the literature devoted to MCMC and Bayesian 
statistical methodology with very few serious applications of the methodology 
being published. 

More recently in Girolami and Calderhead it was established that 

HMC proposals are in fact local geodesic flows defined on the Riemann man- 
ifold of probability distributions. The Riemann manifold Hamiltonian Monte 
Carlo (RMHMC) methodology constructs geodesic based proposals implicitly 
via Hamiltonian dynamics on the manifold defined by the Fisher-Rao metric 
tensor and the corresponding Levi-Civita connection. The paper has raised an 
awareness of the differential geometric foundations of MCMC schemes such as 
HMC and has already seen a number of methodological and algorithmic de- 
velopments as well as some impressive and challengi ng ap plications expl oiting 
these geom etric MCMC meth ods (K onukoglu et al. 120111 Martin et al. l2012t 
Raue et al. 120121: Vanlier et al. l2012h . 

In contrast to Girolami and Calderhead ([20111 ). in this particular paper we 
show how Hamiltonian Monte Carlo methods may be designed for and applied to 
distributions defined on manifolds embedded in Euclidean space, by exploiting 
the existence of explicit forms for geodesies. By way of specific illustration we 
consider two such manifolds: the unit hypersphere, corresponding to the set of 
unit vectors in R d , and its extension to Stiefel manifolds, the set of p-tuples of 
orthogonal unit vectors in R d . 

Such manifolds occur in many statistical applications: distributions on cir- 
cles and spheres, such as the von Mises distrib ution, are common in problems 
dealing with directional data (Mardia and Jupp |2000l ). Orthono rmal b ases arise 
in dimension- reduction methods such as factor analysis (Jolliffe[l986), and can 
be used to construct distributions on matrices via eigendecompositions. 

The problem of sampling from such distributions has not received much 
attention. Most methods in wide use, such as those used in directional statistics 
for sampling from spheres, have been developed for the specific problem at 
hand, often based on rejection sampling techniques tuned to a specific family. 
For the various multivariate extensions of these distributions, these techniques 
are usually embedded in a Gibbs sampling scheme. 

There are relatively few works on the general problem of sam pling f rom man- 
ifolds. The recent paper by Diaconis, Holmes, and Shahshahani (2012) provides 
a readable introduction to the concepts of geometric measure theory, and prac- 
tical issues when sampling from manifolds, with the motivation of computing 
certain s amplin g distributions for hypothesis testing. Brubaker, Salzmann, and 
Urtasun (|2012n . somewhat similar to our approach, develop a HMC algorithm 
using the iterative algorithm for approximating the Hamiltonian paths. 

In the next section, we provide a brief overview of the necessary concepts 
from differential geometry and geometric measure theory, such as geodesies and 
Hausdorff measures. In section [3] we construct a Hamiltonian integrator that 
utilises the explicit form of the geodesies, and incorporate this into a general 
HMC algorithm. Section @] gives examples of various manifolds for which the 
geodesic equations are known, and section [5] provides some illustrative applica- 
tions. 
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2 Manifolds, geodesies and measures 



2.1 Manifolds and embeddings 

In this section, we introduce the necessary terminology from differential geom- 
etry and information geometry. A more rigo rous treatment can be fo und i n 
reference books such as do Carmo ( 1976L 1992 ) and Amari and Nagaoka ( 2000h . 



An m-dimensional manifold M. is a set that locally acts like R m : that is, for 
each point x G A4, there is a bijective mapping q, called a coordinate system, 
from an open set around x to an open set in R m . Our particular focus is on 
manifolds that are embedded in some higher-dimensional Euclidean space R™, 
(that is, they are submanifolds of R"). Note that R d is itself a d-dimensional 
manifold, which we refer to as the Euclidean manifold. 

Example 2.1. A simple example of an embedded manifold is the hypersphere or 
(d — l)-sphere: 

S^ 1 = {x G R d : \\x\\ = 1}. 

This is a (d — l)-dimcnsional manifold, as there exists an angular coordinate 
system G (0, 2n) x (0,7r) d ~ 2 where 

i-2 sin (f> n -u 
■j_2 COS 4> n -l, 





= sine 


h ■ 


. sin 


X2 


= sine 


h. 


. sin 


X3 


= sinr 


h. 


. cos 



x n -i = sm <pi cos 02, 
x n = cos 4>\ . 

Note that this coordinate system excludes some points of § d_1 : such as 5d — 
(0, . . . , 0, 1). As a result, it is not a global coordinate system (in fact, no global 
coordinate system for exists), nevertheless it is possible to cover all of § d_1 
by utilising multiple coordinate systems known as an atlas. 

A tangent at a point x € M. is a vector v that lies "flat" on the manifold. 
More precisely, it can be defined as an equivalence class of the set of functions 
{7 : [a, b] — > M : 7(io) = x} that have the same "time derivative" gjg(7(£))|t=t 
in some coordinate system q. For an embedded manifold, however, a tangent 
can be represented simply as a vector v G R™ such that 

v = i^) = ^T(*)| t=t0 - 



The tangent space is the set T x of such vectors, and form a subspace of R": this 

dxj 
dqj 



is the equal to the span of the set of partial derivatives of some coordinate 



system q. 

Example 2.2. A function on the sphere 7 : [a, b] — > must satisfy the 

constraint 53-s[7i(*)] 2 = 1- By taking the time derivative of both sides, we find 
that 



i d d 

a 



^£[7i(t)] 2 = 25>(*)7i(*)=0. 
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Therefore the tangent space at x £ E> d 1 is the (d — l)-dimensional subspace of 
vectors orthogonal to x: 

T x = {v £ R d : x T v = 0}. 

A Riemannian manifold incorporates a notion of distance, such that for a 
point q £ M, there exists a positive-definite matrix G, called the metric tensor, 
that forms an inner product between tangents u and v 

(u,v)g = u T G(q)v. 

Information geometry is the application of differential geometry to families 
of probability distributions. Such a family {p(- \ 9) : 9 G 6} can be viewed as a 
Riemannian manifold, using the Fisher-Rao metric tensor 



Gij — —E X \6 



d 2 

\ogp{X | 



89, 89 3 



Example 2.3. The family of <i-dimensional multinomial distributions 

p{z\6) = 0?....-9?, z = 5 1 ,...,6 d 
where Si is the ith coordinate vector, is parametrised by the unit (d— l)-simplex, 

A d_1 = {9 G R d : 0i >0,J2 6 i = 1 }- 

3 

This is a (d— l)-dimensional manifold embedded in R n , and can be parametrised 
in [d— 1) dimensions by dropping the last element of 9, the set of which we will 
denote by A?~^ . 

The Fisher-Rao metric tensor in A d ~^ is then easily shown to be 



: ^Li fl otherwise. 



A smooth mapping from a Riemannian manifold to W 1 is an isometric em- 
bedding if the Riemannian inner product is equivalent to the usual Euclidean 
inner product. That is 

s T G{q)t = u T v, where u, — ^ Jp-- S i> v i — X! 7T^' 



, ih l, ( '"A; 



or equivalently, 



^ 8 qi dq 3 

The existence of such embeddings is determined by the celebrated Nash (|l956h 
embedding theorem, however it doesn't give any guide as how to construct them. 
Nevertheless, there are some such embeddings we can identify. 
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Figure 1: Unit 2-simplex A 2 and the positive orthant of the 2-sphere § 2 . The 
lines on the simplex are equidistant: the transformation to the sphere stretches 
these apart near the boundary. 



Example 2.4. There is a bijective mapping from the simplex A d_1 to the positive 
orthant of the sphere § d_1 by taking the element- wise square root n — \/Wi (see 
Figured]). If we consider it as a mapping from A^~^, the partial derivatives are 
of the form 

dxi f K 1/2 if » = i< d, 

ii(l-Etx^)- 1/2 tfi = / = 4 

Note that by ([XJ) , this is an isometric embedding (up to proportionality) of the 
Fisher Rao metric from Example 12.31 



2.2 Geodesies 

The affine connection of a manifold determines the relationship between tangent 
spaces of different points on a manifold: interestingly, this depends on the path 
7 : [a, b] — >■ A4 used to connect the two points, and for a vector field v(t) G T^ft) 
along the path, we can measure the change by the covariant derivative. 

Of course, the time derivative j(t) = ^-jp- is itself a such a vector field: 
when this follows the affine connection, the covariant derivative is 0, in which 
case 7 is known as a geodesic. 

This property can be expressed by the geodesic equation 

7i(*)+Eli*(7(*))7 J -(*)^(*)=0 ) (2) 
i,k 

where Tj k (x) are known as the connection coefficients or Christoffel symbols. 
A Riemannian manifold induces a natural affine connection known as the Levi- 
Civita connection. 

In the Euclidean manifold W 1 , the Christoffel symbols Tj k are zero, and so 
the geodesic equation ((2J reduces to j(t) — 0. Hence the geodesies are the set 
of straight lines y(t) =at + b. 

In a Riemannian manifold, the geodesies are the locally extremal paths (max- 
ima or minima in terms of calculus of variations) of the integrated path length 

b 

\m)\\ G dt, where \\v\\% = v T Gv. 
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Figure 2: A geodesic (•—>■) and great circle ( ) on the sphere § 2 , and its path 

in the spherical polar coordinate system x — (sin^i sin(/>2,sin0i cos 4>2, cos 4>\). 
The ellipses correspond to equi- length tangents from each marked point. 



Moreover, the geodesies have constant speed, in that ||7(£)||g is constant over 
t. As the geodesies can be determined by the metric, they are consequently 
preserved under any metric-preserving transformation, such as an isometric em- 
bedding. 

Example 2.5. A standard result in differential geometry is that the geodesies of 
the n-sphere are rotations about the origin, known as great circles (see Figure[2]) : 

x(t) = x(0) cos(at) + sin(crf) 
a 

where x(0) G S™ is the initial position, v(0) is the initial velocity in the tangent 
space (i.e. such that x(0) T v(0) = 0), and a — ||w(0)|| is the constant angular 
velocity. 

For any geodesic 7 : [a, b] — >• M., the geodesic flow describes the path of 
the geodesic and its tangent (7(4), 7(f)). Moreover, it is unique to the initial 
conditions (x,v) = (7(a), 7(a)), so we can describe any geodesic flow from its 
starting position x and velocity v: this is also known as the exponential map. 
If all such pairs (x,v) describe geodesies, the manifold is said to be geodesically 
complete, which is true of the manifolds we consider in this paper. 



2.3 The Hausdorff measure and distributions on manifolds 

As our motivation is to sample from distributions defined on manifolds, we 
introduce some basic concepts of geometric measure theory that will be useful 
for this purpose. Geometric measure theory is a large and active topi c, an d 
is covered in detail in references such as Federer (1969) and Morgan (2009). 



However for a more accessible overview with a statistical flavour, w e sug gest 



the recent introduction given by Diaconis, Holmes, and Shahshahani (|2012f ). 

Our key requirement is a reference measure from which we can specify prob- 
ability density functions, similar to the role played by the Lebesgue measure for 
distributions on Euclidean space. For this we use the Hausdorff measure, one of 
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the fundamental concepts in geometric measure theory. This can be defined rig- 
orously in terms of a limit of coverings of the manifold (see the above references), 
however for a manifold embedded in R™, it can be heuristically interpreted as 
the surface area of the manifold. 

The relationship between H™ 1 , the m-dimensional Hausdorff measur e, and 
A™ 1 , the Lebesgue measure on R m , is given by the area formula (Federer 1969, 
Theorem 3.2.5). If we parametrise the manifold by a Libschitz function / : 
Rm M ,i 5 then for any -H^-measurable function g : R n -> R, 



</(/(«)) J m f{u) \ m {du) = / g{x) \{ueA: /(«) = x} H m (dx). 

A J R n 

Here J m f(x) is the m-dimensional Jacobian of /: this can b e defined as a 
norm on the matrix of partial derivatives Df{x) fFederer Il969l Section 3.2.1), 
and if rankD/(x) = m, then [J m f(x)] 2 is equal to the sum of squares of the 
determinants of all m x m sub-matrices of Df{x). 

Example 2.6. The square root mapping in Example 12.41 from ^Zd) to has 
(d — l)-dimensional Jacobian 



1 d 

2 d— nx 1 2 - 



=i 

The Dirichlet distribution is a distribution on the simplex, with density 

1 d 

v ' i=i 

with respect to the Lebesgue measure on ■ Therefore, the corresponding 

density with respect to the Hausdorff measue on S d_1 is 

2Qi-l 



B(a) 



n 

i=l 



X ■ 



In other words, the uniform distribution on the sphere arises when on = |, 
whereas a = 1 gives the uniform distribution on the simplex (see Figure . 

The area formula allows th e Hau sdorff measure to be easily extended to 
Riemannian manifolds (Federer 1 19691 Section 3.2.46), where 



H m (dq) = ^\G{q)\X m (d q ). 

This construction would be familiar to Bayesian statisticians as the Jeffreys 
prior, in the case where G is the Fisher-Rao metric. 

When working with probability distributions on manifolds, the Hausdorff 
measure forms the natural reference measure, and allows for reparametrisation 
without needing to compute any additional Jacobian term. We use itu to denote 
the density with respect to the Hausdorff measure of the distribution of interest. 

Example 2.7. The von Mises distribution is a c ommon family of distributions de- 
fined on the unit circle (Mardia and Jupp 200Ct section 3.5.4). When parametrised 



by an angle 9, the density with respect to the Lebesgue measure on [0, 2tt) is 

2tt1 {k) 
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Figure 3: Densities of different beta(a, a) distributions for u £ (0, 1) (left), and 
their corresponding transformations to the positive quadrant of the unit circle 

§ , by the mapping u h-> (yl — u, y/u) (right), a = 0.1 ( ), a = 0.5 ( ), 

and a — 1.0 ( ). 



The embedding transformation x = (sin 6, cos 9) has unit Jacobian, so the den- 
sity with respect to the 1-dimensional Hausdorff measure is 

wn ^> = o t\\ ih e *v{c T x}, 
27rJo(||q|) 

where c = (k sin//, kcos/z) and is the modified Bessel function of the first 
kind. In other words, it is a natural exponential family on the circle. 

The von Mises-Fisher distrib ution is the natural extension to higher-order 
spheres (Mardia and Jupp 2000l section 9.3.2), with density 



ll c ll p/2_1 

ku{x) = , -r-fj-exp-tc x}. 

Attempting to write this as a density with respect to the Lebesgue measure 
on some parametrisation of the surface, such as angular coordinates, would be 
much more involved, as the Jacobian is no longer constant. 



3 Hamiltonian Monte Carlo on embedded man- 
ifolds 

Riemannian manifold Hamiltonian Monte Carlo (RMHMC) is a Markov chain 
Monte Carlo (MCMC) scheme where new samples are proposed by approxi- 
mately solving a system of differential equations describing t he pa ths of Hamil- 
tonian dynamics on the manifold (Girolami and Calderhead 1201 11 ) . 

The key requirement for Hamiltonian Monte Carlo is the symplectic integra- 
tor. This is a discretisation that approximates the Hamiltonian flows, yet main- 
tains certain desirable properties of the exact solution, namely time-reversibility 
and volume preservation. The standard approach is to use a leapfrog scheme, 
which alternately updates the position and momentum via first order Euler 
updates rNeal l201lh . 
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Given a target density ir(q) (with respect to the Lebesgue measure) in some 
coordinate syste m q, R iemannian manifold Hamiltonian Monte Carlo, Girolami 
and Calderhead ([20111 ) utilises a Hamiltonian of the form 



H(q,p) = - logTrfo) + - log \G(q)\ + ~p T G(q)- l p, 

where G is the metric tensor. This is the negative log of the joint density 
(respect to the Lebesgue measure) for (q,p), where the conditional distribution 
for the auxiliary momentum variable p is Af(0, G(q)). 

The first two terms can be combined into the negative log of the target 
density with respect to the Hausdorff measure of the manifold 

H(q,p) = - logir n (q) + \p T G{q)- l p. (3) 

By Hamilton's equations, the dynamics are determined by the system of differ- 
ential equations 

dq dH i 

Tt = = G{q) p ' (4) 

| = ~ = V,pog*tt(ff) - \p T G{q)^p\. (5) 

As this Hamiltonian is not separable (that is, it cannot be written as the sum 
of a function of q and a function of p) , we are unable to apply the standard 
leapfrog integrator. 



3.1 Geodesic integrator 

Girolami and Calderhead (|201lh develop a generalised leapfrog scheme, which 
involves composing adjoint Euler approximations to (j?]) and ([5]) in a reversible 
manner. Unfortunately, some of these steps do not have an explicit form, and 
so need to be solved implicitly by fixed-point iterations. Furthermore, these 
updates require computation of both the inverse and derivatives of the metric 
tensor, which are 0(m 3 ) operations; this limits the feasibility of numerically 
naive implementations of this scheme for higher-dimensional problems. Finally, 
such a scheme assumes a global coordinate system, which may cause problems 
for manifolds for which none exist, such as the sphere, where artificial boundaries 
may be induced. 

In this contribution we instead construct an integrator by splitting the Hamil- 



tonian (Hairer, Lubich, and Wanner 2006, section II. 5): that is, we treat each 



term in ([3]) as a distinct Hamiltonian, and alternate simulating between the 
exact solutions. 

Splitting methods have been used in other c ontext s to develop alternative 
integrators for Hamiltonian Monte Carlo (Neal 1 2 1 lL section 5.5. 1), su ch as 
extending HMC to infinite-dimensional Hilbert spaces (Beskos et al. 2011), and 
defining schemes that may reduce computational cost (Shahbaba et al 



20111) . 



We take the first component of the splitting to be the "potential" term 
HW(q,p) = -log G ir(q). 
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Hamilton's equations give the dynamics 

9 = _ ^r p = — £hT = Vi ? log « 7r( ^- 

Starting at (g(0), p(0)), this has the exact solution 

q(t) = q(0) and p(f) = p(0) + tV, lo g 7r K (<z)| 9=(?(0) . (6) 

In other words, this is just a linear update to the momentum p. 
The second component is the "kinetic" term 

HW{q,p) = \p T G{q)- l p, (7) 

This is simply a Hamiltonian absent of any potential term, the solution of Hamil- 
ton's equations can be easily shown to be a geo desic flow under the Levi-Civita 
connection of G (Abraham and Marsden 19781 Theorem 3.7.1), or to be more 
precise, a co-geodesic flow (q(t),p(t)), where pit) = G(q(t))q(t). 

Our integrator is then constructed by alternately simulating from the dy- 
namics of 7? W and for some time step e. Each iteration of the integrator 
consists of the following steps, starting at position (q,p) in the phase space 

1. Update according to the solution to in ©, for a period of e/2 by 
setting 

P + |V 9 l0g7T«((j) (8) 

2. Update according to H^ 2 \ by following the geodesic flow starting at (q,p), 
for a period of e. 

3. Update again according to #W for a period of e/2 by © . 

As with the RMHMC algorithm, the metric G need only be known up to pro- 
portionality: scaling is equivalent to changing the time step e. 

The solutions to H W and are reversible and symplectic, as they are 
themselves are Hamiltonian systems. The integrator is constructed by com- 
position of these solutions in a symmetric manner, so it is also reversible and 
symplectic. 

Therefore the overall transition kernel for our Hamiltonian Monte Carlo 
scheme from an initial position go, is as follows 

1. Propose an initial momentum po from N(0, G(qo)) . 

2. Map (<7o,Po) ^ (qT,Pr) by running T iterations of the above integrator. 

3. Accept the qx as the new value with probability 

1 A exp{-Hiq T ,p T ) + H(q ,p )}, 
otherwise return the original value go- 



lf) 



3.2 Embedding coordinates 

The algorithm can also be written in terms of an embedding, which avoids 
altogether the computation of the metric tensor and the possible lack of a global 
coordinate system. 

Given an isometric embedding £ : M. — > R™, then the path x(t) — £(q(t)), 
such that 

3 J 

Therefore, we can transform the phase space (q,p), where q = G _1 p, to the 
embedded phase space {x, v), such that 

v = ± = MG(q)~ 1 p = M(M T M)~ 1 p where U, = 

oqj 

since G = M T M, from (fTJ). 

By substitution, the Hamiltonian ((3]) can be written in terms of these coor- 
dinates as 

H=-\ogn n (x) + ±v T v. (9) 

Note the target density tt-h is still defined with respect to the Hausdorff measure 
of the manifold, and so no additional log-Jacobian term is introduced. 

We can rewrite the solution to ifW in in these coordinates. The position 
x(t) remains constant, and by the change of variables the operator V g = M T V X , 
the velocity has a linear path 

v(t) = v(0) + tM(M T MyH^Vx log Tr n (x)\ x=x{0) . 

The linear operator M (M T M M T is the "hat matrix" from linear regression: 
this is the orthogonal projection onto the span of the columns of M, i.e. the 
tangent space of the embedded manifold. 

Although it is possible to compute this projection using standard least 
squares algorithms, it can be computationally expensive and prone to numerical 
instability at the boundaries of the coordinate system (for example, at the poles 
of a sphere). However for all the manifolds we consider, there exists an explicit 
form for an orthonormal basis N of the normal to the tangent space, in which 
case we can simply subtract the projection onto the normal: 

v(t) = »(0) + t(I - NN T )V X log7r w (x)| B=je(0) . 

Finally, we require a method for sampling the initial velocity vq. Since 
Po ~ Af(Q, G(q)), it follows that 

v - JV(0, M(M T M)- 1 M T ) = Af(0, 1 ~ NN T ). 

We don't need to compute a Cholesky decomposition here: since (7 — NN T ) 
is a projection, it is idempotent, so we can draw z from 7V(0, I n ), and project 
vq = (I — NN T )z to obtain the necessary sample. 

The resulting procedure is presented in Algorithm [TJ In order to implement 
it for an embedded manifold M. C R™, we need to be able to evaluate the 
following at each x e M: 
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Algorithm 1 The transition kernel for Hamiltonian Monte Carlo on an embed- 
ded manifold using geodesic flows, 
l: v~Af(0,I n ) 

2: V <- V - N(x)N(x) T V 
3: h 4— log7T^(x) — \ vTv 
4: X* 4- X 

5: for r = 1,. . . ,T do 

6: v <- v+ fV x . logir-uix*) 
7: v 4- v - N(x)N(x) T v 

8: Update (x*,v) by following the geodesic flow for a time interval of e 

9: v <- v+ fVa- logir-uix*) 
10: V 4- v - N(x)N(x) T v 
11: end for 

12: /l* log7T^(x*) - \v T V 
13: U~W(0,1) 

14: if u < exp(/i* — /i) then 
15: x <- a;* 
16: end if 



• The log-density with respect to the Hausdorff measure log7r^, and its 
gradients. 

• An orthogonal projection from E™ to the tangent space of x G M 

• The geodesic flow from any v € T X M. 

4 Embedded manifolds with explicit geodesies 

In this section, we provide examples of embedded manifolds for which the ex- 
plicit forms for the geodesic flow is known, and derive the bases for the normal 
to the tangent space. 

4.1 Affine subspaces 

If the embedded manifold is flat, e.g. an affine subspace of R™, then the geodesic 
flows are the straight lines 

[x(t),v(t)] = [x(0),v(0)] 

In the case of the Euclidean manifold R™ , then the normal space to the tangent 
is null and no projections are required, and hence the algorithm reduces to the 
standard leapfrog scheme. 

In standard HMC, it is common to utilise a "mass" or "preconditioning" 
positive-definite matrix M, essentially a constant Riemannian metric, in order 
to reduce the correlation between samples, especially where variables are highly 
correlated, or have different scales of variation. This is equivalent to our proce- 
dure on the embedding of x = L T q, where L is a matrix square-root such that 
LL T = M. 



1 
t 1 
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4.2 Spheres 

Recall from earlier examples, the unit (d— l)-sphere S d_1 is an (d— l)-dimensional 
manifold embedded in M. d , characterised by the constraint 

x T x = 1, 



with tangent space 

{v G M. d : x T v = 0}. 

Distributions on spheres, particularly S 1 an d § 2 , arise in many problems in 
directional statistics (Mardia and Jupp 2000f) : examples include the von Mises- 
Fisher distribution (Example 12 .71) . and the Bingham- von Mises-Fisher distribu- 
tion fsection l5.1[) . For many of these distributions, the normalisation constants 
of the density functions are often computationally intensive to evaluate, which 
makes Monte Carlo methods particularly attractive. 

As mentioned in Example [531 the geodesies of the sphere are the great circle 
rotations about the origin. The geodesic flows are then 



[x(t),v(t)} = [x(0),v(0)] 



1 

a- 1 



cos(at) 
sin(ai) 



— sin(crf) 
cos(ai) 



1 

a 



(10) 



where a = \\v(t)\\ is the (constant) angular velocity. The normal to the tangent 
space at x is x itself, so (I — xx T )u is an orthogonal projection of an arbitrary 
u G M. d onto the tangent space. 

Other than the evaluation of the log-density and its gradient, the compu- 
tations only involve vector-vector operations of addition and multiplication, so 
the algorithm scales linearly in d. 

4.3 Stiefel manifolds 

A Stiefel manifold V^p is the set of d x p matrices X such that 



X T X = I. 



In other words, the set of matrices with orthonormal column vectors, or equiv- 
alently, the set of p-tuples of orthogonal points in It is a [dp — \p(j> + 1)]- 

dimensional manifold, embedded in R dxp . In the special case where d = p, the 
Stiefel manifold is the orthogonal group Od- the set of d x d orthogonal matrices. 

These arise in the statistical problems related to dimension reduction such 
as factor analysis and principal component analysis, where the aim is to find 
a low dimensional subspace that represents the data. They can also arise in 
contexts where the aim is to identify orientations, such as projections in shape 
analysis, or the eigendecomposition of covariance matrices. 

Again, we can find the constraints on the phase space by the time-derivative 
of the constraint for an arbitrary curve X(t) in V^p 

±[X(t) T X(t)} = X(t) T X(t) + X(t) T X(t) = 0. 

That is, the tangent space at X is the set 

{V G M dxp : V T X + X T V = 0}. 
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If we let x denote the matrix X written as a vector in R p by stacking the 



columns 



Xi, . 



then an orthonormal basis N for the normal to the tangent 



space has p vectors of the form 



Xl 




"0" 




"0" 





1 


X 2 


J ... J 





_0_ 




_0_ 




_ Jp_ 



and 



(2) vectors of the form 











" " 











7! x 3 









5 


71 X2 
















the projection on to the tangent space is 



For an arbitrary vector u € 
then 

Ml — X 1 (xjui) — ^X 2 (xju 2 + xjui) 

u - NN T u = u 2 - x 2 {xju 2 ) - ^xi(xjui + xju 2 ) 



This can be more easily written in matrix form: for an arbitrary U € R dxp , the 
orthogonal projection onto the Stiefel manifold is 



U - -X(X T U + U T X). 

The geodesic flows are more complicated than the spherical case. For p > 1, 
they are no longer simple rotations, but can b e expressed in terms of matrix 
exponentials (Edelman, Arias, and Smith 19991 page 310) 



[X(t),V(t)} = [X(Q),V(0)}exp\t 



-S(0) 
A 



exp{-tA} 

exp{-tA} 



where A = X(t) T V(t) is a skew-symmetric matrix that is constant over the 
geodesic, and S(t) — V(t) T V(t) is non-negative definite. 

Although matrix exponentials can be quite computationally expensive, we 
note that the largest exponential of these is of a 2p x 2p matrix. Other than 
the matrix exponential and the evaluations of the log-density and its gradients, 
all the other operations are simple matrix additions and multiplications, the 
largest of which can be done in 0(dp 2 ) operations, hence the algorithm scales 
linearly with d. 

For the ortho gonal group the geodesies have the simpler form (Edelman, 
Arias, and Smith 19991 equation 2.14) 



[X(t),V(t)] = [X(0),V(0)} 



enp{tA} 
exp{tA} 



14 



4.4 Product manifolds 

Given two manifolds Mi and M 2 , their cartesian product 

Mi x M 2 = {(xi,x 2 ) : xi e Mi,x 2 £ M 2 ) 
is also a manifold. 

Product manifolds arise naturally in many statistical problems, for example 
extensions of the von Mises distributions to S 1 x S 1 (a torus) have been used to 
model molecular angles (Singh, Hnizdo, and Demchuk 120021 ) . and the network 
eigenmodel in section 15731 has a posterior distribution on V miP x t p x 1. 

The geodesies of a product manifold are of the form (71 ,72), where each 7, is 
a geodesic of Mi- Likewise, the tangent vectors are of the form (vi, v 2 ), where 
each Vi is a tangent to Mi- Consequently, for an arbitrary vector (ui,u 2 ), the 
orthogonal projection onto the tangent space is ([J — NiN^]ui, [I — N 2 Nj]u 2 ), 
where iVj is an orthonormal basis of Mi. 

As a result, when implementing our geodesic Monte Carlo scheme on a prod- 
uct manifold, the key operations (addition of gradient, projection, and geodesic 
update) can be essentially be done in parallel, the only operation requiring 
knowledge of the other variables being the computation of the log-density and 
its gradient. Moreover, when tuning the algorithm, it is possible to choose differ- 
ent e values for each constituent manifold, which can be helpful when variables 
have different scales of variation. 



5 Illustrative examples 

5.1 Bingham— von Mises— Fisher distribution 

The Bingham-von Mises-Fisher (BVMF) distribution is the exponential family 
on S''" 1 with linear and quadratic terms, with density of the form 

tt-h(x) oc exp{c T x + x T Ax}, 

wher e c is a vector of length d, and A is a d x d symmetric matrix (Mardia and 
Jupp |2000l section 9.3.3). 



The Bingham distribution arises as the special case where c = 0: this is an 
axially-bimodal distribution, with the modes corresponding to the eigenvector 
of the largest eigenvalue. The BVMF distribution may or may not be bimodal, 
dependi ng on the parameter values. 

Hoff (<2009h develops a Gibbs-style method for sampling from BVMF distri- 



bution by first transforming y = E T x, where E T AE is the eigen-decomposition 
of A. Each element yi of y is updated in random order, conditional u £ S d_2 , 
where Uj = yjj y/ 1 — yf for j 7^ i. The yi | u are sampled using a rejection sam- 
pling scheme with a beta envelope, however as noted by Brubaker, Salzmann, 



and Urtasun (|2012l ). this can give exponentially poor acceptance probabilities 
(of the order of 10~ 100 ) for certain parameter values, particularly when c is large 
in the direction of the negative eigenspectra. 

Implementing our geodesic sampling scheme for the BVMF distribution is 
straightforward, as the gradient of the log-density is simply c + 2 Ax, and ex- 
tremely fast to run, with run times that are independent of the parameter values. 
However, as with any gradient-based method, it has difficulty switching between 
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ci = ci = 20 ci = 40 



Figure 4: Trace plots of X4 from 1000 samples from the spherical geodesic 
Monte Carlo sampler (with parameters e = 0.01, T — 20) for a Bingham-von 
Mises-Fisher distribution, with parameters A — diag(— 10, — 5, 5, 10) and c = 
(ci, 0,0,0). When the distribution is bimodal (ci = 0,20), the sampler has 
difficulty moving between the modes. 



the multiple modes (see FigureH]). Tempering methods (Neal l2011 , section 5.5.7) 
can be used to alleviate this problem, and could be easily incorporated into this 
scheme. 



5.2 Non-conjugate simplex models 

We can use the transformation to the sphere to sample from distributions on the 
simplex A d_1 . These arise in many contexts, particularly as prior and poste- 
rior distributions for discrete- valued random variables such as the multinomial 
distribution. 

If each observation x from the multinomial is completely observed, then the 
contribution to the likelihood is then 6 X , giving a full likelihood of at most d 
terms of form 



L(e) = U 9 



which is conjugate to a Dirichlet prior distribution. 

Complications arise if observations are only partially observed. For example 
we may have marginal observations, which are only observed to a set S, in 
which case the likelihood term is X^es^si or conditional observations, where 
the sampling was constrained to occur within a set T, with likelihood term 
x /(J2teT These terms destroy the conjugacy, and make computation very 
difficult. 

Such m odels a rise under a case-cohort design (Le Polain de Waroux, Maguire, 



and Moren |2012j) , the risk factors of a particular disease: for the case sample, 
the risk factors are observed conditional on the person having the disease, and 
for the cohort sample the risk factors are observed marginally (as disease sta- 
tus is unknown), and overall population statistics may provide some further 
information as to the marginal probability of the di sease. 

The hyperdirichlet R package (Hankin [2010) provides an interface and 
examples for dealing with this type of data. We consider the volleyball data 
from this package: the data arises from a sports league for 9 players, where each 
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a = 


0.1 a = 


= 0.5 


a ■ 


= 1.0 


a ■ 


= 5.0 




ESS % 


ESS/s ESS % 


ESS/s 


ESS % 


ESS/s 


ESS % 


ESS/s 


RW-MH 


0.0045 


4.15 0.099 


58.9 


0.34 


144 


0.85 


284 


Simplex HMC 


0.0021 


0.0040 0.028 


0.068 


53.7 


585 


51.3 


638 


Spherical HMC 


0.0164 


0.265 76.7 


1254 


85.0 


1376 


91.7 


1499 



Table 1: Average effective sample size (ESS) across coordinates per 100 sam- 
ples, and per second, of the Volleyball model under a Dirichlet(al) prior from 
1000 000 samples. For all samplers, e = 0.01, and for the HMC algorithms, 
T = 20 leapfrog steps were used. 

match consists of two disjoint teams of players, one of which is the winner. The 
probability of a team T\ beating T2 is assumed to be 

StgTx Pt 
J2t£T 1 UT 2 Pt 

where p = (pi, . . . ,pg) € A 8 . We compare three different methods in sampling 
from the posterior distribution for p under a Dirichlet(al) prior, for different 
values of a. The results are presented in Table [1] 

The first is a simple random- walk Metropolis-Hastings algorithm. To ensure 
the planar constraint y'., pi = 1 is satisfied, the proposals are made from a 
degenerate Af(x, e 2 [I — nn ]), where n — d _1 / 2 l is the normal to the simplex. 

The second is the geodesic Monte Carlo algorithm on the simplex. We can 
ensure the planar constraint is satisfied via the affine constraint methods in 
14.11 however we need to further ensure that the integration paths satisfies the 
positivity constraints, which can be achieved by reflecting the path whenever it 
violates the constraint. See the appendix for further details. 

The third is the geodesic scheme based on the square root transformation 
to the sphere from Example 12.41 Although this only defines the distribution 
on the positive orthant, we can extend this distribution to the entire sphere by 
reflecting about the axes (since we only require knowledge of the density up 
to proportionality, we can ignore the fact that it is now 2 d times larger). One 
benefit of this transformation is that the surface is now smooth, so no reflections 
are required. 

For small values of a, both geodesic Monte Carlo samplers perform poorly, 
due to the concentration of the density at the boundaries. These peaks cause 
particular problems for the Hamiltonian-type algorithms, as the discontinuous 
gradients mean that the leapfrog paths give poor approximations to the true 
Hamiltonian paths, resulting in poor acceptance probabilities. Moreover, for 
the algorithm on the simplex, the frequent reflections add to the computational 
cost. 

However when a = 0.5, the spherical sampler improves markedly: recall 
from Example 12.61 and Figure [3] that the Dirichlet(0.5) prior is uniform on the 
sphere, giving continuous gradients. On the simplex however, the density re- 
mains peaked at the boundaries. The simplex sampler improves considerably 
for values of a > 1 (where the gradient is now flat or negative), however the 
spherical algorithm still retains a slight edge. 
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200 



100 - 



-100 - 



Figure 5: Trace plots of two chains of 1000 samples of the diagonal elements of A 

from geodesic Monte Carlo sampler o n the n etwork eigenmodel. One chain ( ) 

converges to the same mode as Hoff (2009j), while the other ( ) converges to 

a local mode, with approximately 10~ 36 of the density. 



5.3 Eigenmodel for network data 

We use the network eig enmodel of Hoff ((2009) to demonstrate how Stiefel mani- 



fold models can be used for dimension reduction, and how our geodesic sampling 
scheme may be used for Stiefel and product manifolds. This is a model for a 
graph on a set of m nodes, where for each unordered pair of nodes {i, j}, there 
is a binary observation Yujy indicating the existence of an edge between i and 

The specific example of Hoff (2009) is a protein interaction network, where 



for m — 270 proteins, the existence of the edge indicates whether or not the 
pair of proteins interact. 

The model represents the network by assuming a low {p = 3) dimensional 
representation for the probability of an edge 

P(Y {iJ} =l) = $([[/A[/ T ] 4 ,+c), 

where <f> : M — > (0, 1) is the probit link function, U is an orthonormal m x p 
matrix, and A is a p x p diagonal matrix. U is assumed to have a uniform prior 
distribution on V m . p (with respect to the Hausdorff measure), the diagonal 
elements of A have a 7V(0, m) distribution, and c ~ Af(0, 10 2 ). 

Hoff ( 2009f ) sample from the posterior by introducing latent Gaussian vari- 



ables for each edge: conditional on these, the parameters U have a matrix- valued 
Bingham-von Mises-Fisher distribution, which can be sampled using an exten- 
sion of the usual vector- valued sampler mentioned in section 15.11 This can be 
incorporated into a Gibbs-style scheme for sampling all the parameters. 

We implement geodesic Monte Carlo on the product manifold, details of 
which are given in the appendix. Trace plots from two chains of the diagonal 
elements of A appear in Figure [S] note that one chain appears to get stu ck in a 
local mode, while the other converges to the same as the method of Hoff (|2009l) . 
Also, there is no switching between modes, despite their exchangeability. Again, 
it may be possible to remedy this with a tempering scheme. 
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6 Conclusion and discussion 



We have presented a scheme for sampling from distributions defined on mani- 
folds embedded in Euclidean space by exploiting their known geodesic structure. 
This method has been illustrated using applications from directional statistics, 
discrete data analysis and network analysis. 

The major constraint of this technique is the requirement of an explicit 
form for the geodesic flows that can be easily evaluated numerically. These are 
not often available, for instance the geodesic paths of ellipsoids require often 
computationally intensive elliptic integrals. 

Of the examples we consider, the geodesies of the Stiefel manifold case are 
the most difficult. Although we do not use them in our examples, the skew- 
symmetric struc ture o f A allows special r outine s to be used for its exponential 
(Gallier and Xu l2002t Cardoso and Leite l2010h . and it is possible that similar 
routines may also exist for the larger matrix exponentiation. An alternative ap- 
proach would be to utilise a Metropolis-within-Gibbs style scheme over subsets 
of columns, for example by updating a pair of columns such that they remain 
orthogonal to the remaining columns. 

However once the geodesies and the orthogonal tangent projection of the 
manifold are known, the remaining process of computing the derivatives is 
straightforward, and could be easily implemented using automatic differenti- 
ation, tools, as is used in the Stan MCMC library (Stan Development Team 



20121 ). currently under development. 



Our approach could be widely applicable to problems in directional statistics, 
such as the estimation of normalisation constants which are often otherwise 
numerically intractable. More generally, it could be easily incorporated as part 
of a larger method for fitting distributions on sph eres, a s part of an exact- 
approximate sampling scheme ( Andrieu and Roberts l2009h . 

The method of transforming the simplex to the sphere could be useful for 
applications dealing with high-dimensional discrete data, such as statistical ge- 
netics and language modelling. Stiefel manifolds arise naturally in dimension 
reduction problems, our methods could be particularly useful where the data 
are not normally distributed, for instance the analysis of survey data with dis- 
crete responses, such as Likert scale data. Furthermore, this method could be 
utilised in statistical shape and image analysis for determining the orientation 
of objects in projected images. 



Acknowledgements 

Simon Byrne is funded by a BBSRC grant (BB/G006997/1) and an EPSRC 
Postdoctoral Fellowship (EP/K005723/1). Mark Girolami is funded by an EP- 
SRC Established Career Fellowship (EP/J016934/1). 

Contact address 

Email: simon.byrne@ucl.ac.uk 

Postal address: Department of Statistical Science, University College London, 
Gower Street, London WC1E 6BT, United Kingdom. 



19 



A Reflecting at boundaries of the simplex 



Neal (|201l[ section 5.5.1.5) notes that when an integration path crosses a bound- 
ary of the sample space, it can be reflected about the normal to the boundary to 
keep it within the desired space. He considers boundaries that are orthogonal 
to the ith axis, with normals of the form Si. Whenever such a constraint is 
violated, the position and velocity are replaced by 

x\ = bi + (bi — x{) and v[ — — Wj, 

where bi is the boundary (either upper or lower). As no other coordinates 
are involved in this reflection, this can be done in parallel for all constrained 
coordinates. 

However for the simplex A d_1 , the normals are not of the form Si, as this 
would result in the path being reflected off the plane {x : J2i x i = !}■ Instead, 
we need to reflect about the projection of Si onto the plane, that is 

"'' V; 1 where m = S { - ((T 1 / 2 !)^- 1 ^!)^.. 



IKII y/d(d - 1) 

A procedure for performing the position updates is given in Algorithm [2] 



Algorithm 2 The geodesic updates on the simplex incorporating reflection off 

the boundaries. 

1: u) <— e 

2: while cj > do 

3: (s,I) <— (min, argmhij){— Xi/vi : Vi < 0} {The time until any coordinate 

is negative-valued: this can only occur when the velocity is negative.} 
4: x ^— x + min(o;, n)v 
5: lj ui — min(u;, k) 
6: if uj > then 
7: v <— v — 2njfijv 
8: end if 
9: end while 



B Network eigenmodel 

Define the p x p symmetric matrices r/ = UAU T + c and Y * , where 

{ 1 Y { i,j} = 

then using the property that 1 — $(a;) = $(— x), the log-density of the posterior 
is 

p a 2 c 2 

logn n (U, A,c) = l °S®( Y *jVij) - 2^ ~ 200 + constant - 
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The gradients with respect to the parameters are 



log 7T% 

dUi r 

log 7T% 

dA rr 
dc 

where the gradients with respect to the linear predictors are 

%i ~ ij HYi*Vi.iY 

The program was implemented in MATLAB. To avoid numerical overflow 
errors, the ratio <fi(x)/<S>(x), as well as log$(a;) for negative values of x, are 
calculated using the erf cx function. The matrix exponential terms were calcu- 
lated using the inbuilt expm function, which utilises a Pade approximation with 
scaling and squaring. 

Different e values were used for each parameter: ejj = 0.005, e\ = 0.1 and 
e c = 0.001. T = 20 leapfrog steps were run for each iteration. 



E 9 fog 7T-H j T TJ Arr 



dr)i_ 



8 log TTfi c 

~dru j 100' 



E 

{ij} 
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